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Abstract 

In a provocative and influential paper, Jesse Rothstein (2010) finds that standard value-added models 
(VAMs) suggest implausible future teacher effects on past student achievement, a finding that obviously cannot 
be viewed as causal. This is the basis of a falsification test (the Rothstein falsification test) that appears to 
indicate bias in VAM estimates of current teacher contributions to student learning. More precisely, the 
falsification test is designed to identify whether or not students are effectively randomly assigned conditional on 
the covariates included in the model. 

Rothstein's finding is significant because there is considerable interest in using VAM teacher effect 
estimates for high-stakes teacher personnel policies, and the results of the Rothstein test cast considerable 
doubt on the notion that VAMs can be used fairly for this purpose. However, in this paper, we illustrate — 
theoretically and through simulations— plausible conditions under which the Rothstein falsification test rejects 
VAMs even when students are randomly assigned, conditional on the covariates in the model, and even when 
there is no bias in estimated teacher effects. 




Introduction 



In a provocative and influential paper, "Teacher Quality in Educational Production: Tracking, 
Decay, and Student Achievement," Jesse Rothstein (2010) reports that VAMs used to estimate the 
contribution individual teachers make toward student achievement fail falsification tests, which appears 
to suggest that VAM estimates are biased. More precisely, Rothstein shows that teachers assigned to 
students in the future have statistically significant predictive power in predicting past student 
achievement, a finding that obviously cannot be viewed as causal. Rather, the finding appears to signal 
that student-teacher sorting patterns in schools are not fully accounted for by the types of variables 
typically included in VAMs, implying a correlation between omitted variables affecting student 
achievement and teacher assignments. Rothstein presents this finding (his falsification test) as evidence 
that students are not effectively randomly assigned to teachers conditional on the covariates in the 
model. 

Rothstein's falsification test has become a key method for academic papers to test the validity 
of VAM specifications. Koedel and Betts (2009), for instance, argue that there is little evidence of bias 
based on the Rothstein test when the VAM teacher effect estimates are based on teachers observed in 
multiple classrooms over time, and in their analysis of the impacts of teacher training. Harris and Sass 
(2010) report selecting school districts in which the Rothstein test does not falsify when estimating the 
impacts of teacher training programs. Briggs and Domingue (2011) say that they use the Rothstein test 
to critique value-added results that were produced for the Los Angeles public schools and later publicly 
released. 

Rothstein's finding has considerable relevance because there is great interest in using VAM 
teacher effect estimates for policy purposes such as pay for performance (Podgursky and Springer 2007; 
Eckert and Dabrowski 2010) or determining which teachers maintain their eligibility to teach after some 
specified period of time, such as when tenure is granted (Goldhaber and Hansen 2010a; Gordon et al. 
2006; Hanushek 2009). Many would argue that, if VAMs are shown to produce biased teacher effect 
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estimates, it casts doubt upon the notion that they can be used for such high-stakes policy purposes. 1 
Indeed, this is how the Rothstein findings have been interpreted. For instance, in an article published in 
Education Week, Debra Viadero (2009) interprets the Rothstein paper to suggest that "'value-added' 
methods for determining the effectiveness of classroom teachers are built on some shaky assumptions 
and may be misleading." 2 As Rothstein (2010) himself said, the "results indicate that policies based on 
these VAMs will reward or punish teachers who do not deserve it and fail to reward or punish teachers 
who do." In a recent congressional briefing, Rothstein cited his falsification test results and said that 
value-added is "not fair to. ..special needs teachers... [or] other specialists." 3 

In this paper, we show that the Rothstein falsification test can reject when there is no bias. 
Specifically, we describe plausible conditions under which the Rothstein test rejects the null hypothesis 
of no impacts of future teachers on lagged achievement even when students are randomly assigned 
conditional on the covariates in the model and there is no bias in estimated teacher effects. We verify 
these conditions theoretically and through a series of simulations. Our findings are important because, 
as noted above, the Rothstein test is shaping not only academic studies but also public perception about 
the efficacy of utilizing VAMs. 



The Rothstein Falsification Test and Bias 

A. The Value-Added Model Formulation 

There is a growing body of literature that examines the implications of using VAMs in an attempt to 
identify causal impacts of schooling inputs and contributions of individual teachers toward student 



1 Other researchers have come to somewhat different conclusions about whether VAMs are likely to produce biased 
estimates of teacher effectiveness. Kane and Staiger (2008), for instance, find that certain VAM specifications 
produce teacher effect estimates that are similar to those produced under experimental conditions, a result supported 
by non-experimental findings of Chetty et al. (2011) who exploit differences in the estimated effectiveness of 
teachers who transition between grades or schools to test for bias. And while Koedel and Betts (2009) confirm 
Rothstein’s basic findings about single-year teacher effect estimates, they report finding no evidence of bias based 
on the Rothstein falsification test when the VAM teacher effect estimates are based on teachers observed in multiple 
classrooms over time. 

2 Others have cited his work in similar ways (Hanushek and Rivkin 2010; Rothman 2010; Baker et al. 2010; and 
Kraemer et al. 2011). 

3 Haertel et al. (2011). 
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learning (for example, Chetty et al. 2011; Ballou et al. 2004; Goldhaber and Hansen 2010a; Kane and 



Staiger, 2008; McCaffrey et al. 2004, 2009; Rothstein, 2009, 2010; Rockoff, 2004). Researchers typically 
assume their data can be modeled by some variant of the following equation: 

(1) Aj g — cr+/lAj(g_;L)+5]/?tTtjg+ejg 

where A ig is the achievement of student i in grade g, 

a is the intercept and also the value added of the omitted teacher, 4 
A is the impact of lagged achievement, 

T tig is a dummy variable identifying if student i had teacher t in grade g, 

/? t is the impact of teacher t compared to the omitted teacher, 5 and 
e ig is an error term that represents other factors that affect student learning. 6 
If e ig is uncorrelated with the other variables in equation 1, then the impact of teacher t can be 
estimated by regressing student achievement (A ig ) on prior student achievement (A^)) and dummy 
variables identifying the teacher student i had in grade g (x Ug ). 7 Of course, this model makes a number 
of assumptions about the nature of student learning; see, for instance, Harris et al. (2010), Rothstein 
(2010), or Todd and Wolpin (2003) for more background on these assumptions. 8 

Rothstein questions whether standard VAMs produce unbiased estimates of teacher effect on 
student learning. 9 In particular, he describes a number of ways in which the processes governing the 



4 The intercept equals the value of the outcome when all other variables are set to 0. If the achievement scores 
(current and baseline) are mean centered, the intercept equals the outcome for a student with an average baseline 
score who has the omitted teacher. 

5 In the rest of this paper, we generally refer to p, as the impact of teacher t. This can be thought of as equivalent to 
saying that the results are normalized so that the impact of the omitted teacher is 0. 

6 This model is similar to the VAM2 model discussed by Rothstein (2010), which also allows the coefficient on 
lagged achievement to differ from 1 and excludes student-fixed effects. Value-added models often also include 
school and grade fixed effects and a vector of student and family background characteristics (for example, age, 
disability, English language status, free or reduced-price lunch status, race, ethnicity, whether a student has 
previously been retained in grade, and parental education). These factors could be incorporated into our models with 
no changes in the substantive findings. Rothstein (2010) also discusses student-fixed effects and measurement error, 
two issues that we address below. 

7 Like much of the value-added literature, Rothstein does not try to separate classroom and teacher effects. 

8 Note that, even if VAMs produce effect estimates that are unbiased, they may not be very reliable. For more on 
this, see Goldhaber and Hansen (2010b), McCaffrey et al. (2009), and Schochet and Chiang (2010). 

9 Parameter estimates can often be consistently estimated with large sample sizes but remain biased when sample 
sizes are small relative to the number of parameters being estimated. In this paper, we focus primarily on situations 
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assignment of students to teachers may lead to erroneous conclusions about teacher effectiveness. Of 



particular concern, is the possibility that students may be tracked into particular classrooms based on 
skills that are not accounted for by A i( g.i). 10 

B. Bias in VAM Teacher Effect Estimates 

If equation 1 is estimated using ordinary least squares, the estimated impacts of a teacher can be 
biased for a number of reasons. To derive a formula for this bias, we divide the error term (e ig ) into two 
components— ov ig , which, if it exists, we assume to be correlated with at least some of the covariates 
included in the model even after controlling for the others, and eu ig , which is uncorrelated with any of 
the covariates in the model. 

Thus, 

e ig = Y'0Vi g +eu ig 

where y is the coefficient on ov ig . y is also the coefficient that would be obtained on ov ig , were it added 
to equation 1. 

It should be noted that, by definition, if ov ig exists it would cause bias for at least some coefficient 
estimates. However, as we describe below, it can exist and not necessarily cause bias for the estimated 
teacher effects. 

The general formula for omitted variable bias for the effect of teacher t takes the form: 

BiasCpt) = E(fi t )-P t = yn te 

where 

P' = estimate of P’ and 

7Z" te = coefficient on r tig from a regression of ov ig on all right-hand side variables in equation 1 (except e ig ). 

with large sample sizes. Kinsler (2011) investigates what happens to the Rothstein test when smaller sample sizes 
are used. 

10 This could happen for a number of reasons. For example, principals might assign students to teachers based in part 
on unobserved factors that impact current achievement but are not captured by lagged achievement. Principals might 
do this either because they believe that certain teachers have a comparative advantage in educating certain students 
or because the principals are rewarding certain teachers with choice classroom assignments. Similar results could 
hold if parents lobby to have their children assigned to particular teachers. 
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It can be shown that, 



(2) 11 7T te = C0v(0V* ig , T* ti g)/V(T*ti g ) 

where ov* ig and T* tlg are the residual values of ov ig and r tig that remain after controlling for other 
variables in a linear regression model . 12 

This formulation is helpful because it shows that only the residual value of ov ig matters for bias in 
the estimated teacher effects. In other words, ov ig would not cause bias in the estimated teacher effects 
if its residual is uncorrelated with the residual of r tig . 13 

If one estimates a linear VAM, then the residual of ov ig would also be based on a linear regression. 
Consequently, another way to state the condition of obtaining unbiased teacher effects in the presence 
of an omitted variable in a linear VAM is as follows: if the omitted variable is a linear function of lagged 
achievement and the other control variables in the equation, then it need not cause bias in the 
estimated teacher effects because the partial correlation is picked up by the included variables. 

If the omitted variable is a nonlinear function of lagged achievement, then it is more likely to cause bias. 
However, even in this case, it may still be possible to obtain unbiased estimates of the teacher effects if 
the omitted variable is time-invariant (that is, ov ig =oVi). It may be possible, however, to address the issue 
of tracking based on time-invariant factors through the inclusion of a student fixed effect, which 
captures time-invariant student or family background attributes that affect current achievement in ways 
not captured by previous achievement . 14 But one of the significant contributions of Rothstein's work is 
that he raises an additional concern about VAMs: he postulates that student sorting into classrooms is 



11 Equation 2 holds because each coefficient estimate from any linear regression can be obtained by regressing the 
residualized outcome on the residualized version of the corresponding right-hand side variable in that equation. Each 
residualized variable is equal to the residual obtained after regressing the original variable on all of the other right- 
hand side variables in the equation (Goldberger 1991). 

12 We use the abbreviation “cov” for “covariance” and “var” for “variance” in equations throughout the paper. We 
use to indicate a residual from a linear regression of the variable in question on lagged achievement. We could 
also write 0 Vi g = 7t xe x itg + 0 Vi g *, where x itg represents all the covariates in the model except i tig and 7i xe is a vector of 
coefficients on those variables. In our example, the vector x itg includes lagged achievement and all other teacher 
variables in equation 1 . 

13 The variable ov; g could still cause bias in the estimated impacts of other covariates, like the lagged achievement 
variable, and therefore be considered an omitted variable for the regression. 

14 Studies differ on the efficacy of including student fixed effects — Elarris and Sass (2010) argue for it, while Kane 
and Staiger (2008) argue against. Koedel and Betts (2009) find that student fixed effects are statistically insignificant 
in the models they estimate. 
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dynamic" in the sense that an omitted factor that affects the error term may be time-varying and 



correlated with placement into classrooms (that is, a form of tracking). This could lead to bias. He 
specifically suggests that the omitted factors cause the errors to be negatively correlated over time. 15 
This could be due to compensating behavior whereby students who have a good year (along unobserved 
lines) receive fewer inputs in the following year than students who are otherwise similar. 16 This dynamic 
form of tracking cannot be accounted for by the inclusion of a simple student fixed effect. 

C. Rothstein's Falsification Test 

Rothstein's falsification test relies on the notion that evidence of the statistical significance of future 
teachers in predicting past achievement suggests a misspecification of the VAM (Rothstein 2010). 
Similar falsification tests are often used in the economics literature (Heckman and Hotz 1989; 
Ashenfelter 1978). However, as discussed below, these tests are generally used to check the 
specification of models that differ in important ways from VAM. 

We focus our paper on the version of the Rothstein test described in footnote 13 of his paper, 
where Rothstein states that "...random assignment conditional on A^ will be rejected if the grade-g 
classroom teachers predict A ig _ 2 conditional on AjgV' (this is described more formally in the next 
subsection). 17 Rothstein implements this test using a linear regression. For example, achievement in 
grade 3 can be regressed on grade 4 achievement and grade 5 teachers. The test is whether the 
coefficients on the grade 5 teachers are jointly significant. 



15 Rothstein’s evidence of a negative correlation of errors over time is derived from a VAM without student fixed 
effects. If important time-invariant omitted student factors exist, implying the need for student fixed effects, we 
would expect to see a positive correlation across grade levels between the errors in a model estimated without 
student fixed effects. Given this, it seems unlikely that student fixed effects explain Rothstein’s finding of bias. 

16 This could happen if family contributions to achievement are negatively correlated over time, conditional on past 
achievement. Alternatively, this could happen if the impacts of the error term decay at a different rate than other 
factors. By allowing the coefficient on lagged achievement to be less than one, the VAM we consider allows for 
decay, but implicitly assumes that the decay is the same for all factors (prior achievement, teachers, and the error 
term). 

17 It is important to note that the outcome in this equation is grade g-2 achievement. If one were to instead look at 
the impact of grade g teachers on grade g-1 achievement and control for grade g-2 achievement, then the nature of 
the test changes, although many of the same general properties hold. In particular, the test will often reject when 
there is no bias for estimated teacher effects. Results based on this test are available upon request. 
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Rejecting the null of no future teacher effects might then be taken as evidence of bias. In other 
words, many people may assume it to imply a correlation between the independent variation in teacher 
assignments and the error in the achievement equation. But a more precise formulation of the 
necessary exclusion restriction for unbiased teacher effects is that current grade teacher assignments 
are orthogonal to all residual determinants of the current score that are not accounted for by the 
exogenous variables included as controls in the model. This does not, however, mean that future grade 
teacher assignments need necessarily be orthogonal to the residual determinants of past scores. 
Rothstein's falsification test does provide evidence of tracking and that tracking could cause VAM 
misspecification. But, as we show below, the Rothstein test will reject, even in the absence of any 
omitted variables that might lead to biased teacher effect estimates. 

D. Comparing the Rothstein Falsification Test to a Formal Condition for Bias 

The central connection between the Rothstein falsification test and a formal condition for bias is 
through the tracking of students. However, tracking of students to future teachers (implied by the 
Rothstein test) need not imply that students are tracked to current teachers. For example, students can 
be tracked into future teacher classrooms based, at least in part, on past achievement, even if they were 
randomly assigned to current teachers. For this reason, the Rothstein test only applies to bias for 
current teacher effect estimates if tracking systems are consistent across grades. Consistent tracking 
across grades also means that we can lag the Rothstein test one period and get the same results. We do 
this because it provides a means of concisely comparing the Rothstein falsification and bias tests. 

For simplicity, in comparing the Rothstein and bias tests in this section of our paper we utilize a 
simple VAM with only one school and two teachers in each grade (in Section III we show simulations 
with more complex data structures). The dummy for one teacher is omitted from the regression and the 
current grade teacher assignment depends entirely on lagged achievement. Thus: 
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(3) A ig - AAi(g.i) + ^igTiig+^ig 

where cov(e ig; A j (g_ 1 )) = cov(e ig , T lig ) = 0. 

We specify a flexible functional form for tracking for teacher 1 (the one with higher value added): 

Ti ig = T(A i(g . 1) ) 19 

Rothstein's falsification test is based on a regression of lagged achievement on current achievement 
and future teachers. 20 As noted earlier, to simplify our discussion we lag the Rothstein test one period so 
that instead of focusing on the relationship between future teachers and lagged achievement, the test is 
based on the relationship between current teachers and double-lagged achievement. This enables us to 
compare the conditions for bias and the Rothstein test using the same set of teachers. If the possible 
sources of bias are the same across grades, this simplification has no effect on our results. 

The Rothstein test using future teachers can be written as: 

Aj( g _i) = AlA ig + Ri ( g +i) Tiiig+ij+w^g.ij 

where R 1(g+1) describes the regression-adjusted relationship between future teachers and lagged 
achievement. 

When we lag this one period, we get, 

(4) 21 A i(g _ 2 )= A lAjig.j) +Ri g T lig +w i ( g _ 2 ) 

The Rothstein test involves estimating whether or not R lg differs from 0. 

The numerator in R lg is the covariance between current teachers and the residual from a regression 
of double-lagged achievement on current achievement. 22 

(5) A i(g _ 2 ) = A2 A ig +u i(g _ 2 ) 



18 To simplify our presentation, we omit intercepts and measurement error from these equations. Adding them back 
does not change our substantive findings. Numerous methods exist to obtain unbiased estimates in the presence of 
such measurement error (Potamites et al. 2009; Meyer 1999; Fuller 1987). 

19 T() is bounded between 0 and 1 and <5TT()/t/A ( i, g _D > 0. 

20 In Table IV of his paper, he uses a similar test in which he adds in current teachers as an additional set of control 
variables in the model estimating future teacher effects. As shown in our simulations, this second test will also often 
reject when there is no bias. 

21 As shown earlier, footnote 13 of Rothstein’s paper uses the same grade levels — g, g-1, and g-2. 

22 More precisely, R lg = cov(x Ug , A^A^))/ varf-c^lA^)) = cov(i Ug , u^lA^))/ varf-t^JA^.!)) since u i(g _ 2) is 
the residual that remains after regressing Aj ( g_ 2 )On A; (g _i). 
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If Ri s is 0 then A2 equals Al and u i( g_ 2 ) equals w i( g_ 2 ). Thus, one can test to see if R lg differs from 0 using 
the following covariance: 

cov(r lig , u i( g. 2 )| Aijg.ijJoO 23 

For comparison, here is the condition for obtaining biased estimates: 
cov(r lig , e ig )| Aiig.yJoO 

Formulated in this way, the Rothstein falsification test and formal condition for bias appear quite 
similar but differ in a key respect— e ig is not the same as u i( g_ 2 ). One can, therefore, generate data in 
which there is no bias but the Rothstein test rejects (and vice versa). 

We break up the error term from equation 5 into three pieces to show factors that might cause the 
Rothstein test to falsify unbiased VAMs. Rearranging and substituting out for A i(g .i) yields: 
u i(g-2) = A i(g-2)- 22 (A A i( g_ 2 ) +/?l(g-i)Tii( g _i)+ e i ( g . 1 )) 

(6) = (1- A 2 A) Aj(g_2) - A 2/?l(g-i)Tij( g _ 1 )- A2 e i(g .i) 24 

Equation 6 shows that u i( g_ 2 ) can be written as a linear function of three variables— A i( g. 2 ), r^g-i), and 
e i(g-i)-The Rothstein test is based on testing whether cov(T lig , u i (g.2)|A i (g. 1 ))<>0, which means it will reject if 
any of the following three conditions hold: 



(Cl) 


cov(t lig , A i(g _ 2) | A i(g . 1) )<>0 


(C2) 


cov(t 1 ig,t li( g_ 1) |A i(g _ 1) )<>0 


(C3) 


cov(t 1 ig,e i(g _ 1) |A i(g _ 1) )<>0 



E. Exploring the Conditions That Cause the Rothstein Test to Falsify 

The first condition that would cause the Rothstein test to reject is that current teachers are 
conditionally correlated with double-lagged achievement, after controlling for lagged achievement in a 
linear regression. This is not implausible, as school systems may not have ready access to lagged 

23 Throughout, we use cov(x,y| A i(g _| ,) to describe the covariance that remains between variables x and y after 
controlling for A l(g _ , j using a linear regression. 

24 This equation is similar to Rothstein’s equation 7 except that it includes double-lagged achievement on the right- 
hand side. 
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achievement scores at the point at which teacher assignment decisions are made. 25 Consequently, many 
schools may use double-lagged achievement for tracking decisions. 

The second condition that would cause the Rothstein test to reject occurs when current teachers are 
conditionally correlated with lagged teachers, after controlling for lagged achievement. This is likely if 
some classrooms disproportionately consist of students who shared the same classroom in the previous 
year, perhaps because schools intentionally keep certain students together (or apart). We refer to this 
as "classroom tracking." 26 

Conditions 1 and 2 both relate to the possibility that there is a variable left out of the VAM that 
affects tracking and that might, therefore, cause bias. However, if these variables (double-lagged 
achievement and lagged teachers) affect current achievement only through their impacts on lagged 
achievement, as is implied by equation 3, then they may cause the Rothstein test to reject, but their 
omission from the VAM will cause no bias. 

The third condition that would cause the Rothstein test to reject occurs when the current teacher is 
conditionally correlated with the lagged error term, after controlling for lagged achievement. This 
condition might seem unlikely to matter since e,^ is part of A^!). Thus, one might assume that 
controlling for A^.y would account for a correlation between and r Ug . However, this turns out not 
to be the case because the falsification test is linear while both the current teacher and double-lagged 
achievement are nonlinear functions of lagged achievement and the nonlinearities can be correlated. 27 



25 For instance, Mathematica does value-added work in various states and localities where teacher effect estimates 
are needed in a timely way to inform key policy decisions. In many of these locations, state achievement score data 
from the spring of one school year are often not available until the fall of the following school year, too late to affect 
tracking decisions for that year. See, for example, Potamites et al. (2009) and Chaplin et al. (2009). 

26 There are alternative tests for bias caused by these types of tracking. For example, for the first condition, one can 
include double-lagged achievement in the VAM model and test to see if the estimated coefficient estimates on 
current teachers change compared to a model without that variable (Rothstein 2009). Similarly, for the second 
condition, one can add lagged teachers to a standard VAM. Results of this later test are likely to be very imprecise 
for many teachers, especially in smaller schools, if most of their students come from a single lagged teacher. 

~ 7 Current teachers are a nonlinear function of lagged achievement because the tracking equation is bounded 
between 0 and 1 . Double-lagged achievement can be described using a nonlinear function of lagged achievement 
because the lagged teachers create discontinuous jumps in lagged achievement that are not in double-lagged 
achievement. The two sources of nonlinearity can be correlated because both depend on lagged achievement. This 
result suggests that the Rothstein falsification test may not work well for VAM. This does not mean that falsification 
tests would not work in other situations. For example, Fleckman and Flotz (1989) propose a general falsification test 
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Consequently, the linearly based falsification test can suggest implausible current teacher impacts on 



double-lagged achievement in an unbiased VAM. We describe this issue in detail in Appendix A and 
show evidence of this in our simulations. 

This third condition is important because Rothstein (2010) expresses particular interest in the 

distribution of error terms. In particular, he acknowledges that evidence that future teachers are 

statistically significant predictors of past achievement is not itself proof of bias for current teachers. 28 He 

goes on to say that tracking accompanied by negative correlation in the errors across grades (that is, 

cov(e, g ,e (ijg . 1) ) < 0) "strongly suggests" bias for current grade teachers. More precisely he says, 

"A correlation between treatment and some pre-assignment variable X need not indicate 
bias in the estimated treatment effect if X is uncorrelated with the outcome variable of 
interest. But outcomes are typically correlated within individuals over time, so an 
association between treatment and the lagged outcome strongly suggests that the 
treatment is not exogenous with respect to post-treatment outcomes (Rothstein 2010)." 

We agree that, if the errors are correlated (either negatively or positively) and there is tracking, then 

the teacher effects will probably be biased (see Appendix B for more detail on this point) and the 

Rothstein falsification test will reject based on condition 3 (see Appendix A). However, as we show, the 

test will also reject based on conditions 1, 2, or 3 even without negatively correlated errors. Thus, one 

cannot use the test to definitively identify bias caused by negatively correlated errors. Similarly, one also 

cannot use the Rothstein test to definitively identify variables left out of the VAM that might cause bias 



for nonexperimental estimators based on the same underlying concept as Rothstein — that a treatment cannot affect 
past outcomes. They were looking at the impacts of a job training program. Rothstein (2010) cites Ashenfelter 
(1978) on this issue. Ashenfelter was also looking at impacts of a job training program. Job training programs are 
often taken only once every few years. Hence, there may be no analogy to the lagged teacher that would generate 
nonlinearities in the relationship between lagged earnings and double-lagged earnings that are correlated with the 
current training program, like the nonlinearities in the relationship between lagged achievement and double-lagged 
achievement that are likely in VAM. Hence, while our results suggest concern regarding the use of falsification tests 
for a VAM, they do not rule out the use of falsification tests in other situations, such as those considered by 
Heckman and Hotz (1989), and even in other education research in which the goal is to estimate effects of programs 
or policies that target large numbers of classrooms. 

28 More generally, in personal correspondence with others (see Chetty et al. 2011, footnote 53), Rothstein has stated 
that his test is “neither necessary nor sufficient for there to be bias in a VA estimate.”. Rather his test suggests cause 
for concern about bias that might be caused by unobservables. We view our findings as showing conditions under 
which that bias might also be small. Chetty et al. (2011) present non-experimental evidence suggesting small bias 
caused by unobservables, supporting earlier experimental findings by Kane and Staiger (2008). 
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(such as double-lagged achievement or lagged teachers) since the test will reject based on condition 3 



even without conditions 1 or 2. 



Simulation Results 

We performed a number of simulations that illustrate the findings reported in the preceding 
section. For consistency with Rothstein (2010), we simulated data for grades 3 through 5. 29 We 
conducted Rothstein falsification tests (described by equation 4 above) and tests for bias for the 
estimated grade 5 teacher effects. 30 The errors in the achievement equations are jointly normally 
distributed and have standard deviations of 0.4. We set the coefficient on lagged achievement to either 
0.91 or 0.95, depending on the model, to keep the achievement level standard deviations close to l. 31 
The standard deviations of grade 5 teachers were set to 0.1. This is the value Rothstein used for his 
baseline model in Appendix C of his 2010 paper and is in the range of the estimates reported by 
Hanushek and Rivkin (2010). The standard deviations for grades 3 and 4 teachers were set to either 0.1 
or 0, depending on the model. 

We simulated data for 200 schools with four teachers per school and 20 students per teacher 
for a total of 800 teachers and 16,000 students in each model. 

We simulated data setting the school effects to zero but controlled for school effects both in our 
VAMs and in the Rothstein tests. Thus, we estimated effects for three teachers in each school, or a total 
of 600 teachers across the sample. 

We based tracking on various subsets of the following five factors: (1) previous achievement, (2) 
double-lagged achievement, (3) the previous teacher (a dummy variable), (4) a random component that 

29 We started with normally distributed achievement in grade 2 and then added in teacher effects and normally 
distributed errors for achievement in grades 3, 4, and 5. 

30 The grade 5 teachers can be thought of as future teachers in grade 4 using the regular Rothstein test or current 
teachers using our revised test (lagging the Rothstein test one period). As noted earlier, if tracking systems are stable 
across grades, then the choice of grade levels will not matter. 

31 The choice of the coefficient on lagged achievement does not impact our substantive findings as long as it is not 
zero. The achievement scores all have standard deviations between 0.98 and 1.01. 



12 




has a standard deviation of 0.2, and (5) an omitted variable (discussed below) with a standard deviation 



of 0.2. Within schools, we split students into four groups of 20 each based on an indicator variable equal 
to the sum of the tracking factors used. The factors used vary depending on the model, as described in 
Table 1. In some models, the achievement error terms have a negative correlation of around -0.25. 32 

Rothstein (2010) finds almost no correlation in errors across two periods. Rather, he only finds 
correlations in errors across contiguous periods. To be consistent with his evidence, we generated errors 
with these properties (correlations between contiguous periods but no correlation across non- 
contiguous periods). More precisely, we generated errors using the following formula: 

e ig = WiUig+WzUite-n 

where Wz and w 2 are weights chosen to generate errors with the specified variances and 
correlations across grades. The u ig variables are uncorrelated across grades so the errors separated by 
two or more periods are also uncorrelated. 

As noted above, some models include an omitted variable that impacts tracking decisions. That 
variable also impacts current achievement scores. It is not correlated with lagged achievement. 

For each model, we tested to see if the model was rejected using the Rothstein falsification test 
and also if the impact estimates for grade 5 teachers were biased. 33 To test for bias, we did a joint test 
of the difference of each of the teacher effect estimates from their true values, which were used to 
simulate the data (allowing for correlations across the estimates). We describe the magnitude of the 
estimated future teacher effects from the Rothstein test using their standard deviations. We show 
correlations between the estimated teacher effects and the true effects as a way of assessing the 
magnitude of the bias. 



32 Rothstein simulates data in Appendix C of his paper using a negative correlation of -0.25. He reports correlations 
of -.21 for math and -.19 for reading for the residuals from the VAM based on the North Carolina data he analyzes. 
These estimates are not adjusted for measurement error. He does report that the correlations are too large to be 
caused by measurement error for his “VAMl” model, which assumes a coefficient of one on lagged achievement. 

33 To estimate the standard deviation of current teacher effect estimates, we use estimates based on equation 1. 
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We present five sets of results in Table 1. The first set of columns (under "Results by Condition") 



demonstrates how the Rothstein test performed based on the three conditions discussed above. The 
second set of columns (under "Linear Falsification Test") covers findings when grade 4 achievement was 
linearly associated with grade 3 achievement; these results are useful to show why the Rothstein test 
rejects based on the third condition described above. The third set of columns (under "Negatively 
Correlated Errors") describes conditions under which the Rothstein test worked in the sense that it 
falsified the model when the VAM produced biased teacher effect estimates. The fourth set of columns 
(under "Failing to Falsify") presents cases in which the Rothstein test failed to falsify when it should have 
done so. 34 The last column (under "Rejecting RA") shows that the Rothstein test can reject in spite of 
random assignment (RA) of teachers to classrooms. 

The first four rows show the parameters used to generate the data for each model. Blanks 
indicate zero values. The last six rows of Table 1 show our results. The bias test is a joint test of the 
difference between the estimated teacher effects and the true teacher effects from equation 1 above, 
accounting for the fact that one teacher in each school was dropped. The Rothstein test is a joint test of 
the significance of current teachers for predicting double-lagged achievement controlling for lagged 
achievement (like equation 4 above but with multiple teachers and with school effects). The standard 
deviation of the VAM estimates comes from the VAM results. The standard deviation of the Rothstein 
estimates comes from the coefficients on the teacher dummies produced in the Rothstein test. The raw 

correlation between the true and estimated teacher effects is Raw cor ( A, A) and Adjusted cor (A, A) 
is the correlation adjusted for estimation error. 



34 This is a point he acknowledges is possible in his paper. 
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Table 1. Simulation Results 


for Rothstein Falsification Test and Bias Test, by 


Model 


Parameter 


Results by Condition 


Linear Falsification 
Test 


Negatively Correlated 
Errors 


Failing to Falsify 


Rejecting 

RA 


Result 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 


Teachers Tracked on 


A4, 

a 3 


A4,T4 


A4 


A4,A 3 


A4, T 4 


A4 


Rand 


A4 


A4 


ov* is 


ov* ig ,,A 4 


r 4 


Cor(<? ig ,<?i ; (g_i ))— Cor(<?,(g_ 
n,G(s-2)) 














-0.25 


-0.25 


-0.25 




-0.25 




Var( ov*i g ) 




















0.2 


0.2 




Pi, P 4 


0.1 


0.1 


0.1 








0.1 


0.1 




0.1 




0.1 


Bias Test 


0.95 


0.95 






mm 












2.02 


1.02 


Rothstein Test 


8.34 


8.91 






mm 








mm 




0.97 


2.01 


Std Dev VAM Estimates 


0.132 


0.137 


0.137 


0.132 


0.138 


0.129 


0.138 


0.135 


0.133 


0.189 


0.230 


0.133 


Std Dev Rothstein 
Estimates 


0.331 


0.292 


0.149 


0.311 


0.278 


0.130 


0.129 


0.144 


0.125 


0.133 


0.126 


0.174 


Raw cor(/? t , P) 


0.71 


0.76 


0.75 


0.71 


0.74 


0.71 


0.73 


0.73 


0.75 


0.52 


0.42 


0.71 


Adjusted cor (/? t , P) 


0.99 


1.00 


1.00 


0.98 


0.98 


0.99 


0.97 


0.99 


1.00 


0.60 


0.47 


0.97 



Notes: Blanks indicate zeros. Tracking is based on the sum of the variables indicated in the table and a random error with standard deviation of 
0.2 Rand means this random error was the only one used for tracking. School effects are set to zero. Achievement has a standard deviation 
close to one. The errors e lg , e^g-i) and e^g- 2 ) are in the achievement equations and have standard deviations of 0.4. The variable, ov* lg , affects 
both achievement and tracking in grade 5, but is uncorrelated with previous grade variables. Lambda is 0.91 in all models except for 7, 8, 9, 
and 11, where it is 0.95. The bias test is a joint test of the difference between the estimated and true teacher effects in a standard VAM. The 
Rothstein test is described in the text. The test statistics are F-tests. The cut-point for 5 percent statistical significance given our sample sizes is 
1.11. Test statistics in bold are significant at the 5 percent level. The standard deviation of grade 5 teacher effects is 0.1 in all models. Each 
model has 200 schools, four teachers per school, and 20 students per teacher. The regressions control for school effects so we only estimate 
teacher effects for 600 teachers in each model (three teachers per school and 200 schools). 

Adjusted cor(/? t , P ) is an estimate of the correlation that would be observed between /? t and P' in the absence of any estimation error 
(Spearman 1904; Goldhaber and Hansen 2010b). 
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The first three columns show cases in which the Rothstein test rejected but there was no bias, 
relating them to the three conditions discussed above. 35 For the first two conditions, the reason for the 
Rothstein test to reject seems fairly clear— current teachers were selected in part based on double- 
lagged achievement and/or lagged teachers, both of which are components of the error term from the 
Rothstein test. Hence, current teachers are correlated with that error. In the third column, however, 
neither condition was present and yet the Rothstein test still rejected. Here the false rejection was 
caused by the fact that the bivariate relationship between double-lagged achievement and lagged 
achievement is nonlinear, whereas the Rothstein falsification test is linear, as discussed earlier. The 
results in columns 4 through 9 of Table 1 help to illustrate this and Appendix A explains in more detail 
why this is possible. 

One of Rothstein's findings is that the magnitudes of the future teacher effects are quite large. 
We found this result in models without bias. Indeed, even in column 3, when rejection is due only to the 
nonlinearity issue, the estimated future teacher effects from the Rothstein test are still about the same 
size as the estimated current teacher effects from the VAM. 

The fact that the grade 5 future teacher effects are noticeable for conditions 1 and 2 might not 
seem surprising, given that those conditions imply that variables that affected tracking were left out of 
the VAM equation. The fact that the nonlinearities cause such a large future teacher coefficient estimate 
for condition 3 might seem more surprising. One reason for this is that the denominator of the 
coefficient estimate on the grade 5 teacher is the variance in the grade 5 teacher that remains after 
controlling for lagged achievement. If lagged achievement predicts the current teacher well in a linear 
model, then the denominator (residual variance) may be small, resulting in a relatively large grade 5 

35 In Table 1, we present results generated using the Rothstein test described in equation 4 above. We also ran 
another variation of the Rothstein test for all of the models presented in Table 1, which we call the Rothstein 2 test. 

It was also used in his paper. This involved adding grade 4 teachers to equation 4. For the first three columns of 
Table 1, the results were similar in the sense that the standard deviations of the estimated future teacher effects were 
always larger than those for the VAM estimates for current teachers. 
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teacher coefficient estimate. This means that the "future teacher effects" produced by the Rothstein 



test have magnitudes similar to true teacher effects in models with little or no bias. Thus, their 
magnitudes could be misleading in regard to the magnitude of the bias identified. 

The results presented in the columns under "Linear Falsification Test" are based on data in 
which the baseline (grade 4) scores were linearly related to the grade 3 scores. We did this by setting 
the grade 3 and 4 teacher effects to zero. As explained in Appendix B, this enables the grade 3 and 4 
achievement scores to be linearly related. 

As illustrated in columns 4 and 5 in Table 1 (under "Linear Falsification Test"), the Rothstein test 
still rejects if tracking is based on either grade 3 achievement or grade 4 teachers. 36 However, if neither 
of those conditions hold, then the Rothstein test no longer rejects, as can be seen in column 6. This is 
important because it highlights the fact that the Rothstein test could be used to provide evidence that 
grade 5 students were tracked on either grade 3 (double-lagged) achievement or grade 4 (lagged) 
teachers (conditions 1 and 2) or something correlated with those variables even after controlling for 
grade 4 achievement, if the grade 3 and 4 achievement levels were linearly related. 

In columns 7, 8, and 9 (under "Negatively Correlated Errors"), we present conditions under 
which the Rothstein test did identify bias. In column 7, we generated data with no tracking. Therefore 
there was no bias and the Rothstein test did not reject. This is in spite of the fact that the model had 
negatively correlated errors and lagged teacher effects so that the grade 3 and 4 test scores were no 
longer linearly related. When there was tracking (as in column 8), there was bias and the Rothstein test 
appropriately rejected. However, the amount of bias may have little policy relevance as the correlation 
between the estimated and true teacher effects was close to 1 after adjusting for estimation error. 37 In 

36 The standard deviations of the future teacher effects based on the Rothstein 2 test are larger than the standard 
deviations of the VAM estimates for these models. 

37 We also ran models with much larger teacher effects (standard deviation of 0.75), larger negative correlations (- 
0.90), and both. The adjusted correlations between the estimated and true teacher effects remained at 0.93 and above 
in these models. 
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addition, we did not find statistically significant bias here though this was due only to a lack of 
precision. 38 

The results in column 9 help to illustrate why the model used in column 8 resulted in so little 
bias. In particular, in column 9, we show that if the baseline scores and double-lagged scores were 
linearly related (that is, there were no lagged teacher effects), then there would be no bias. Under those 
conditions, the baseline test can control for the negatively correlated errors (as discussed in Appendix 
B). 39 While it is not likely that baseline and double-lagged scores are linearly related, they may be almost 
linearly related given the small magnitude of the lag teacher effects relative to the overall variance of 
achievement. 

If we knew that the error terms were negatively correlated but were unsure if there was 
tracking, then the Rothstein test would provide evidence of at least some bias. However, given that 
tracking in schools is quite likely, evidence of negatively correlated errors is itself evidence of bias. And, 
as noted earlier, one cannot use the Rothstein test to check for bias caused by negatively correlated 
errors because it is not possible to determine whether the test is rejecting because of the negative 
correlation or because of one of the other conditions specified above. 

In columns 10 and 11 (under "Failing to Falsify"), we present cases in which the Rothstein test 
failed to falsify, but the estimated teacher effects were actually biased. We obtained these results by 
creating an omitted variable that affected both grade 5 achievement and the selection of the grade 5 
teachers, but was uncorrelated with lagged achievement and lagged teachers. Because the variable was 
uncorrelated with lagged achievement, it did not cause the Rothstein test to reject. Column 10 presents 



38 With larger sample sizes we do find bias. Also in results available upon request we simulated data for a model 
very similar to the “baseline” model in Appendix C of the Rothstein paper. We find biased estimates for this model 
which has negatively correlated errors and lag teacher effects. The correlations between the estimated and true 
teacher effect estimates remain well over 0.90 after adjusting for estimation error. This model includes 1.2 million 
teachers and measurement error, as well as teacher and school effects that are correlated across grades. 

39 In columns 7, 8, and 9, the Rothstein 2 test produces future teacher effect estimates with standard deviations that 
are all larger than those presented in Table 1 for the regular Rothstein test. 
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results with uncorrelated errors. In column 11, we show estimates for a model with negatively 
correlated errors that also did not cause the Rothstein test to reject. 40 

Random assignment of individual students to teachers is a clear case in which VAMs yield 
unbiased estimated teacher effects, and the Rothstein test appropriately fails to falsify. Interestingly, 
however, random assignment of groups of students to teachers can cause the Rothstein test to falsify if 
students were tracked into those groups. This is shown in the last column of Table 1, which reports 
findings generated when students were tracked into classrooms based on their previous classrooms 
alone, but teacher assignment to classrooms was random. 41 The same arguments for the Rothstein test 
rejecting hold as for the previous models. Indeed, the first three columns of Table 1 are relevant 
because they simply describe how students were tracked into grade 5 classrooms, but not how grade 5 
teachers were assigned to those classrooms. 42 Thus, they would hold if teachers were randomly 
assigned. 

Conclusion 

As we noted in the outset of this paper, Rothstein's critique of value-added methods used to 
estimate teacher effectiveness has been cited by both research and policymaking communities as a 
reason to doubt the wisdom of using VAMs for high-stakes purposes. The findings we present here, 
however, call into question whether the Rothstein falsification approach provides accurate guidance 
regarding the bias of teacher effect estimates. 

Ideally, the Rothstein test could be used to identify VAMs that produce biased estimates of 
current teacher effects. Our results suggest this is not possible. Moreover, we find that one cannot use 

40 The standard deviations for the Rothstein 2 teacher effects are similar to those for the Rothstein test presented in 
Table 1. 

41 The standard deviation of future teacher effects from the Rothstein 2 test was only 0.075 in this case. 

42 In results available upon request, we ran simulations that align with each of those presented in Table 1, but using 
50,000 students per teacher and only two teachers and one school each. The results were generally similar to those in 
Table 1. An important exception is that we did find bias for column 8 (as expected) in that model. 
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the Rothstein test to reject the hypothesis that students were effectively randomly assigned conditional 



on lagged achievement. 

We would argue that Rothstein's 2010 paper raised important concerns about the ability of 
VAMs to produce unbiased estimates of teacher effectiveness, but the Rothstein test itself (just one part 
of his paper) does not provide useful guidance regarding VAMs. Given this, we believe that more work 
needs to be done to understand the potential for bias in VAMs. This will likely involve more investigation 
into the factors that might cause such bias. One way to approach this topic is to look at the various 
factors affecting student sorting into classrooms other than those typically included as controls in VAMs 
so that such variables could be added to future models. 43 Work like this has been done by numerous 
authors both quantitatively (Jacob and Lefgren 2007) and qualitatively (Kraemer et al. 2011). When 
doing such work, however, researchers may want to keep in mind results of Rothstein (2009, 2010) that 
suggest that some variables often omitted from VAMs (for example double-lagged achievement) may 
affect the selection of current teachers and yet not cause much bias. 

From a policy perspective, the important questions may not be whether there is any bias, but 
the magnitude of the bias. It is quite likely that teacher effectiveness estimates generated from VAMs 
are biased to some degree but, as shown in Rothstein (2009), Kinsler (2011), and our simulations, the 
magnitude of bias may be relatively inconsequential. Decisions about using VAMs should consider how 
this bias compares to potential information that value-added models can provide about teacher 
effectiveness over, or in addition to, other means of assessment. 44 



43 Ashenfelter (1978) makes a similar point in his paper, which looks at similar issues outside of value-added 
models. 

44 Value-added estimates may also be very imprecise (Schochet and Chiang 2010). Other measures of teacher 
effectiveness may also be imprecise, so the policy focus should probably be on how best to obtain more precise 
estimates of teacher performance. This may involve using some combination of VAM and non-VAM measures. 
Indeed, if well implemented such combinations may be optimal both for reducing bias and for improving precision. 
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APPENDIX A 



WHEN FALSIFICATION TESTS FAIL 

In this appendix, we elaborate in more detail why the Rothstein test can incorrectly falsify 
correctly specified VAMs. As discussed in the main body of this paper, the omission of variables 
used for tracking students need not cause bias if they do not impact current test scores directly, 
although the existence of such omitted variables in combination with negatively correlated error 
terms would suggest bias. Conditions 1 and 2 suggest that the Rothstein test could be used to 
help identify the existence of such omitted variables — in particular, double-lagged achievement 
and lagged teachers. Here, however, we illustrate that the Rothstein test cannot be used to rule 
out the possibility that students were tracked randomly conditional on lagged achievement. To do 
this, we use a o ne-school, two-teacher example in which tracking decisions depend only on 
lagged achievement and a random error, and there are no omitted variables from the model. We 
start with an equation for Ri g , the coefficient on the better (that is, more effective) “future” 
teacher from the Rothstein test (see footnote 22). 



Ri g = cov(Ti ig ,A i(g _ 2 ), |A i(g .i))/var(Ti ig |A i(g .i)) = cov(T*i ig ,A* i(g _ 2) )/Var(T*i ig ) 



where A*i( g _ 2 ) and T*i; g are the residuals that result from regressing Ai (g . 2) and Tii g on Ai (g _i). 

As shown in the main text, one component of cov(T*ii g ,A*;( g _ 2) ) is cov(T*i ig ,e*i( g _i)). If this 
component is non-zero, then the Rothstein test can reject incorrectly (condition 3). 

Condition 3 would not occur if lagged achievement and its error tenn were jointly normally 
distributed. In this situation, the expected value of the error would be a linear function of lagged 
achievement and the residual error would be uncorrelated with any function of lagged 
achievement (linear or nonlinear). 45 But, as we show below, it is quite likely that lagged 
achievement is not normally distributed because it is itself impacted by lagged teachers. The 
impacts of these teachers depend on the fraction of time a student is assigned to a teacher, which 
is necessarily bounded between zero and one, and therefore is not normally distributed. This 
means that the expected value of the lagged error (one element in the equation for condition 3) is 
likely to be a nonlinear function of lagged achievement. The current teacher (the other element of 
condition 3) is also a nonlinear function of lagged achievement because assignment to the current 
teacher is also a non-normally distributed variable, that is, students either are, or are not, 
assigned to a given teacher. Since both the lagged error and current teacher are nonlinear 
functions of lagged achievement, they can remain correlated even after conditioning on lagged 
achievement in a linear regression. 

In showing a more formal description of how Rothstein’s test can reject because of 
condition 3, w e make the simplifying assumption that the lagged error is jointly normally 
distributed with double lagged achievement. 46 Given this joint nonnality assumption, we know 



45 If the current teacher is a function of lagged achievement and an uncorrelated error, its residual (controlling for 
lagged achievement) would also be uncorrelated with the lagged error. 

46 We use this example to show how condition 3 might hold because neither of the variables that appear in 
conditions 1 or 2 affect tracking in this example (double-lagged achievement or the lagged teacher). However, it is 
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that for either lagged teacher’s students, the expected value of Ai (g _2) is a linear function of Ai(g.p. 
Let these functions be: 

E(Ai(g_ 2 )| Tii(g-i) = 1 , Ai(g.i)) = (x i +/ 1 1 A Kg. | ) for students with the better lagged teacher, and 

E(Ai( g _2)| r ii(g-i) = 0 , Ai(g.p) = a o+ X oAi (g .i) for students with the omitted lagged teacher. 

Now consider the function for the probability of having the more effective lagged teacher as 
a function of lagged achievement. This is nonlinear since Tii( g -p is a discrete variable. Thus, 



fii(g-i) - Ti(A i(g .i)) 



Note that this is not a tracking function because we are describing a function that relates the 
lagged teacher to the lagged test score — and not to the double-lagged test score. 

Finally, the equation for the expected value of Ai( g .2> as a function of Ai(g.p that combines 
both sets of students can be written as follows: 



E(Ai(g_2)| Ai( g _i)) (oc i+ AiAi(g.i))Ti(Ai( g .i)) + (or o+ To)Ai(g_i)To(Ai(g_i)) 



where To(A i(g .i)) = 1- Ti(A i(g .i)). 



To() and Ti() are both nonlinear functions of Ai( g _i). Thus, E( Ai ( g _2 ) | Ai( g _ i )) is a nonlinear 
function of Ai(g.p. Since both E(Ai( g .2)|Ai( g .i)) and T; g are nonlinear functions of Ai (g .i), this 
suggests that they could be correlated even after controlling for Ai(g.p linearly. This can, in turn, 
cause the Rothstein test to reject incorrectly . 47 



also true that conditions 1 and 2 might hold in this situation. This can happen because of nonlinearities similar to 
those discussed here. 

47 Allowing Aj( g _ 2 ) to be non-normal could introduce additional nonlinearities that could also cause the Rothstein test 
to reject incorrectly. 
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APPENDIX B 



NEGATIVELY CORRELATED ERRORS NEED NOT CAUSE BIAS 

In this appendix, we show that negatively correlated errors need not cause bias if lagged 
achievement scores are normally distributed. To make this point, we first show that negatively 
correlated errors may result in an omitted variable that is a linear function of the lagged 
achievement score. An omitted variable with this property will bias the lagged achievement 
coefficient estimate, but will not cause bias in the teacher effects, a point that is well known. This 
is important because it means that the negative correlation in errors across grades assumed by 
Rothstein need not cause bias on its own. 

To investigate these issues, we consider a model in which the error terms are negatively 
correlated, as Rothstein posits. This could happen if students who had an above-average error 
term in the previous period forget more than students with an average error tenn in direct 
proportion to how far above average they were in the previous period. Similarly, those who had a 
below-average error term in the previous period leam more than students with an average error 
term (perhaps from other students or their teacher), again, in direct proportion to how far they 

ao 

were below average in the previous period. Mathematically, this can be written as: 

<? ig = WlUi g +W2Ui(g-i) 



where cov(ui g ,Ui( g _i)) = 0 and W 2 < 0. 
This implies that 



(B.l) 49 cov(e ig ,f?i( g .i)) = u> 2 var(ui( g .i)) < 0. 



As in our simulations, we assume that lagged achievement depends only on lagged teachers, 
an error tenn, and a random starting point (Aj( g _ 2 )) that is normally distributed. 30 Thus: 



Aig Ai(g_ 1 )A+ T lig/? 1 g+^ig 



Ai( g - 1 ) Ai( g -2 ) A + r i i( g - 1 ) (3 1 ( g - i )+^i(g- 1 ) 



To generate normally distributed baseline scores (and unbiased estimated teacher effect 
estimates), we assume that lagged teachers have no impact on lagged achievement levels. Thus, 



48 This is sometimes described as “regression to the mean,” although that phrase is sometimes used to describe 
situations in which the true errors are uncorrelated across grades. 

49 In the following five lines, we derive equation (B.l). 

COV( 6[g , ^i(g-l) ) 

— cov(pei U ( g _i)+ei ug> p ei u (g_2)+ej u(g _i)) 

— p COv(f lu (g_ | ) £i u (g-2)) + P COV(£i u {g_i),£i u { g _i)) “t" p C O v( j ugi , £ | u ( g _2 )) + COV(<-'j ug , iu(g- 1 )) 

= P 2 0 + p var(e iu( g_i)) + pO 

p varte^g.,)) < 0. 

50 This can be justified if we think of two grade levels in the past as the grade when the child entered school and that 
there was no tracking in that grade. All subsequent learning is captured by later teacher effects and error terms. 
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Ai( g - 1 ) A i( g_2) A +ei( g .i) 



where A i(g . 2 ) , e l(g _i , and e x% are jointly nonnal and uncorrelated with each other by assumption. 
This implies that Ai( g _i) and e,„ are also jointly nonnal and linearly related because a linear 
function of two jointly normally distributed variables is also jointly nonnal and linearly 
associated with each of those variables (Goldberger 1991; Theil 1957). Indeed, all four variables 
are jointly normal and linearly related. 



W(s-i) 
\g- 2) 

eig 






\ei(g -i ) J 



In particular, the expected value of <? lg is a linear function of Ai( g _i). Given this, let ei\ be the 
residual from a regression of e lg on Ai( g .i). 

ei g — yAi( g _i)+eri g 

where y is the coefficient on lagged achievement. 51 

Now we need to show that er X g is unconelated with T; g , the current teacher. We start by 
assuming a specific functional form for Tti g , 

(B.2) Ti,i jg = 1 if A ( i jg _i)>0 and 0 otherwise. 

We then show that er xg is uncorrelated with any function of Ai (g _i) and therefore is 
uncorrelated with Ti >gg . 

By assumption, e xg and A;( g _i > are jointly nonnal. By construction, er xg is a linear function of 
these variables equal to <?j g -/lAj g . Thus er xg and Aj( g _i)are also jointly nonnal. By construction, er xg 
is also uncorrelated with Ai (g _i). Joint nonnality and zero conelation implies independence. This, 
in turn, means that er xg is unconelated with any function of Ai( g _i) regardless of whether it is 
linear or nonlinear. Since r t i & is a function of Aj (g _i), it is also unconelated with eri g . 52 

Using the symbols from equation 2 in the main body of this paper (the omitted variable bias 
formula), this means that cov(<?*i g ,T* lLg ) = 0. To see this, note that er xg is the residual that remains 
after regressing e x% on A i(g .i). Thus, er xg is the same as <?*,„. Similarly, t* i gg is the residual that 
remains after regressing xn, g on Ai( g _i). If e\, is uncorrelated with Ti,. g then it will also be 
unconelated with x*i,, g because T*i, g is a linear function of T| lg and Ai( g _u and e \ is uncorrelated 
with both of those variables. 



51 We expect y to be less than 0 since cov(ei g ,ei (g _i)) is less than 0. 

52 In a more realistic scenario, Tn g would also depend on some additional variables. As long as they are also 
distributed independently of er lg then er lg will remain conditionally uncorrelated with ij g . 

53 In the main body of this paper, we discussed creating residuals by regressing each variable on lagged achievement 
and the other teacher dummies. In this model, there are no other teachers because there are only two teachers and 
one is omitted. 
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The negative correlation in errors means that students who scored lowest on the previous 
test will score somewhat higher than otherwise expected in the current period and vice versa. The 
coefficient estimate on Ai(g.p will be biased downwards, but, in this case, the negative 
correlation has no impact on the coefficient on T n g . 

Some readers might also be concerned about the plausibility of the data generation process 
we propose for tracking because it appears to depend on a latent variable that is a linear function 
of lagged achievement. However, as noted above, tracking is necessarily a nonlinear function of 
this latent variable. The functional form we propose allows for this. More precisely, one could 
think of tracking as a two-stage system in which tracking depends on a 1 atent variable that, in 
turn, depends on lagged achievement. This can be written either by stage or in a single stage by 
substituting out for the latent variable (LV). Thus, 



Stage 1: LV =/?T*A i(g _i) 



Stage 2: r lig =T(LV) 



Combined: Ti ig = T(DT*A( g -i)) 



where /? T* is the coefficient on lagged achievement in Stage 1. 

We have assumed that the first stage is linear. However, even if the first stage were 
nonlinear this would not affect our argument because the second stage is nonlinear. Thus, 
nonlinearity in the first stage is not an issue. What is key to our argument for this appendix — that 
a “plausible” data generation process can yield unbiased results — is that the relationship between 
the current error term and lagged achievement is linear. 
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